Here, we will do a sample workflow for two samples at different stages in zebrafish development: 512 cell stage (~2,7 hours hpf) and Prim 6 (~24 hpf). These stages have been previously published by Nepal et al (2013) and are available in an R package (by Haberle). Similar packages are available for mice and human data.
For the first part of the workflow we will use the R package called “CAGEr”. We will do some of the standard workflow and gradually manipulate the data ourselves in R.
So let’s start by loading in the package for the developmental stages as well as CAGEr:
# packages
require(ZebrafishDevelopmentalCAGE)
require(CAGEr)
Loading the samples of which we have data and display which stages we have here:
# loading data
data(ZebrafishSamples)
head(ZebrafishSamples)
# samples
samples <- as.character(ZebrafishSamples$sample) # to see all the samples
samples
[1] "zf_unfertilized_egg" "zf_fertilized_egg" "zf_64cells"
[4] "zf_512cells" "zf_high" "zf_oblong"
[7] "zf_sphere_dome" "zf_30perc_dome" "zf_shield"
[10] "zf_14somites" "zf_prim6" "zf_prim20"
As said earlier, we will pick just two developmental stages: “zf_512cells” and “zf_prim6”. In the vignette (and manual) the way of loading in these data is by the function importPublicData(). We need to specify the source, dataset, group, and samples. The source is “ZebrafishDevelopment” and the rest are found in the dataframe of the ZebrafishSamples: ZebrafishCAGE, development, samplenames. The avoid spelling everything out we’ll use the vector of samples:
# creating a CAGEset object:
myCAGEset <- importPublicData(source = "ZebrafishDevelopment", dataset = "ZebrafishCAGE",
group = "development", sample = samples[c(4,11)] )
# what does it look like?
myCAGEset
S4 Object of class CAGEset
=======================================
Input data information
=======================================
Reference genome (organism): BSgenome.Drerio.UCSC.danRer7
Input file type: ZebrafishDevelopment
Input file names: ZebrafishDevelopment__zf_512cells, ZebrafishDevelopment__zf_prim6
Sample labels: zf_512cells, zf_prim6
=======================================
CTSS information
=======================================
CTSS chromosome: chr1, chr1, chr1, ...
CTSS position: 3783, 3788, 3798, ...
CTSS strand: +, +, +, ...
Tag count:
-> zf_512cells: 0, 0, 0, ...
-> zf_prim6: 1, 1, 1, ...
Normalized tpm:
=======================================
Tag cluster (TC) information
=======================================
CTSS clustering method:
Number of TCs per sample:
=======================================
Consensus cluster information
=======================================
Number of consensus clusters:
Consensus cluster chromosome:
Consensus cluster start:
Consensus cluster end:
Consensus cluster strand:
Normalized tpm:
=======================================
Expression profiling
=======================================
Expression clustering method:
Expression clusters for consensus clusters:
=======================================
Promoter shifting
=======================================
GroupX:
GroupY:
Shifting scores:
KS p-values (FDR adjusted):
Here, it is good to double check the reference genome and the sample labels. Also this gives us all the slots of the S4 object. So at this stage we have CTSS information as this is already the form of the data of the package. Only the tag count for each TSS for both samples, as the slot of Normalized tpm is empty. As well as the rest of the object.
With the following few standard CAGEr functions we will “fill in” these slots. And when full, show you how we can retrieve and export of these data.
First, we will look at the data globally between the samples. CAGEr has a function do this for you (shown below) to look at the correlation between the samples and plots the png in the working directory. We fill first source an adjusted version of this function to control the output by using source (Creating a new generic function for ‘plotCorrelation’).
corr.m <- plotCorrelation(myCAGEset, samples = "all", method = "pearson")
corr.m # correlation
zf_512cells zf_prim6
zf_512cells 1.0000000 0.4510721
zf_prim6 0.4510721 1.0000000
correlation
Next, we need to normalize our data to adjust for example different library sizes to make them comparable. Here, we’ll use the power-law based normalization (see @balwierz).
librarySizes(myCAGEset)
[1] 4155747 7668701
# normalisation first plot to assess alpha
plotReverseCumulatives(myCAGEset, fitInRange = c(5, 1000), onePlot = TRUE) # plot
# insert alpha in normalize function:
normalizeTagCount(myCAGEset, method = "powerLaw",fitInRange = c(5, 1000), alpha = 1.20, T = 1*10^6)
Now the CTSS information the normalized tpm is added:
myCAGEset
Let’s create tag clusters of CTSSs with the clusterCTSS function of CAGEr. CTSS in close proximity of each other give rise to functionally simialr set of transcripts within the same promoter elements. Tag clusters (TCs) are basically larger transcriptional units that correspond to individual promoters.
Here, we will use a simple distance measurement that individual CTSS can not be more than 20 bp apart. We also set the threshold at 1 which means that each TSS should have 2 or more tag counts in all the samples prior to clustering. Additionally, to remove low fidelity TSSs: no TC are called with a single TC if the normalized signal is below 5.
# Clustering of CTSS :)
clusterCTSS(object = myCAGEset, threshold = 1, thresholdIsTpm = TRUE,
nrPassThreshold = 1, method = "distclu", maxDist = 20,
removeSingletons = TRUE, keepSingletonsAbove = 5)
Keep in mind that these are determined per sample
Have a look again at the myCAGEset! The Tag cluster information slot is filled in.
Another feature we can check is the promoter width. That is, the width of each tag cluster per sample. CAGEr has a function that calculates the TC width by caculating cumulative distribution of tag signal per TC. Subsequently, promoter (TC) width is defined as the distance (in bp) between two quantiles generally set at 0.1 and 0.9 to capture 80% of tags.
# calculate cumlative CTSS distribution:
cumulativeCTSSdistribution(myCAGEset, clusters = "tagClusters")
# determine quantile positions:
quantilePositions(myCAGEset, clusters = "tagClusters", qLow = 0.1, qUp = 0.9)
We can then plot a histogram of the interquantile widths per sample with CAGEr too.
# we can plot the interquantile width with CAGEr too
plotInterquantileWidth(myCAGEset, clusters = "tagClusters",tpmThreshold = 3, qLow = 0.1, qUp = 0.9)
If you look at these two plots, you can see that the 512 cell stage more narrower TCs than prim6 which is what we would expect.
CAGEr vignette:
Tag clusters are often sample-specific, thus can be present in one sample but absent in another. In addition, in many cases tag clusters do not coincide perfectly within the same promoter region, or there might be two clusters in one sample and only one larger in the other. To be able to compare genome-wide transcriptional activity across samples and to perform expression profiling, a single set of consensus clusters needs to be created. This is done using aggregate-TagClusters function, which aggregates tag clusters from all samples into a single set of non-overlapping consensus clusters:
aggregateTagClusters(myCAGEset, tpmThreshold = 5, qLow = 0.1, qUp = 0.9, maxDist = 100)
As CAGE signals is essentially transcription, levels can be compared between groups. This can be done per CTSS, TC, or consensus clusters. CAGEr offers two methods to cluster gene expression, k-means and self-organising maps (SOM). Both require the number of expression cluster to be known in a priori.
Here we will perform the SOM algorithm at consensus cluster level. We set the threshold to at least 10 tpm (normalized) in at least one sample.
getExpressionProfiles(myCAGEset, what = "consensusClusters", tpmThreshold = 10,
nrPassThreshold = 1, method = "som", xDim = 3, yDim = 2)
# in a nice plot:
plotExpressionProfiles(myCAGEset, what = "consensusClusters")
Promoter shifting is a method described by Haberle et al in 2014. They’ve shown that the same promoter can be used differently in different samples. This method has been implemented in CAGEr. Shifting can be detected between two individual samples (or between two groups of samples that are merged per group)
For all promoters a shifting score is calculated based on the difference in the cumulative distribution of CAGE signal along that promoter in the two samples. In addition, a more general assessment of differential TSS usage is obtained by performing Kolmogorov-Smirnov test on the cumulative distributions of CAGE signal, as described below. Thus, prior to shifting score calculation and statistical testing, we have to calculate cumulative distribution along all consensus clusters: